3,647 research outputs found

    Hidden Markov Models for Gene Sequence Classification: Classifying the VSG genes in the Trypanosoma brucei Genome

    Full text link
    The article presents an application of Hidden Markov Models (HMMs) for pattern recognition on genome sequences. We apply HMM for identifying genes encoding the Variant Surface Glycoprotein (VSG) in the genomes of Trypanosoma brucei (T. brucei) and other African trypanosomes. These are parasitic protozoa causative agents of sleeping sickness and several diseases in domestic and wild animals. These parasites have a peculiar strategy to evade the host's immune system that consists in periodically changing their predominant cellular surface protein (VSG). The motivation for using patterns recognition methods to identify these genes, instead of traditional homology based ones, is that the levels of sequence identity (amino acid and DNA sequence) amongst these genes is often below of what is considered reliable in these methods. Among pattern recognition approaches, HMM are particularly suitable to tackle this problem because they can handle more naturally the determination of gene edges. We evaluate the performance of the model using different number of states in the Markov model, as well as several performance metrics. The model is applied using public genomic data. Our empirical results show that the VSG genes on T. brucei can be safely identified (high sensitivity and low rate of false positives) using HMM.Comment: Accepted article in July, 2015 in Pattern Analysis and Applications, Springer. The article contains 23 pages, 4 figures, 8 tables and 51 reference

    Workflows for the Large-Scale Assessment of miRNA Evolution: Birth and Death of miRNA Genes in Tunicates

    Get PDF
    As described over 20 years ago with the discovery of RNA interference (RNAi), double-stranded RNAs occupied key roles in regulation and as defense-line in animal cells. This thesis focuses on metazoan microRNAs (miRNAs). These small non-coding RNAs are distinguished from their small-interfering RNA (siRNA) relatives by their tightly controlled, efficient and flexible biogenesis, together with a broader flexibility to target multiple mRNAs by a seed imperfect base-pairing. As potent regulators, miRNAs are involved in mRNA stability and post-transcriptional regulation tasks, being a conserved mechanism used repetitively by the evolution, not only in metazoans, but plants and unicellular organisms. Through a comprehensive revision of the current animal miRNA model, the canonical pathway dominates the extensive literature about miRNAs, and served as a scaffold to understand the scenes behind the regulatory landscape performed by the cell. The characterization of a diverse set of non-canonical pathways has expanded this view, suggesting a diverse, rich and flexible regulatory landscape to generate mature miRNAs. The production of miRNAs, derived from isolated or clustered transcripts, is an efficient and highly conserved mechanism traced back to animals with high fidelity at family level. In evolutionary terms, expansions of miRNA families have been associated with an increasing morphological and developmental complexity. In particular, the Chordata clade (the ancient cephalochordates, highly derived and secondary simplified tunicates, and the well-known vertebrates) represents an interesting scenario to study miRNA evolution. Despite clearly conserved miRNAs along these clades, tunicates display massive restructuring events, including emergence of highly derived miRNAs. As shown in this thesis, model organisms or vertebrate-specific bias exist in current animal miRNA annotations, misrepresenting more diverse groups, such as marine invertebrates. Current miRNA databases, such as miRBase and Rfam, classified miRNAs under different definitions and possessed annotations that are not simple to be linked. As an alternative, this thesis proposes a method to curate and merge those annotations, making use of miRBase precursor/mature annotations and genomes together with Rfam predicted sequences. This approach generated structural models for shared miRNA families, based on the alignment of their correct-positioned mature sequences as anchors. In this process, the developed structural curation steps flagged 33 miRNA families from the Rfam as questionable. Curated Rfam and miRBase anchored-structural alignments provided a rich resource for constructing predictive miRNA profiles, using correspondent hidden Markov (HMMs) and covariance models (CMs). As a direct application, the use of those models is time-consuming, and the user has to deal with multiple iterations to achieve a genome-wide non-overlapping annotation. To resolve this, the proposed miRNAture pipeline provides an automatic and flexible solution to annotate miRNAs. It combines multiple homology approaches to generate the best candidates validated at sequence and structural levels. This increases the achievable sensitivity to annotate canonical miRNAs, and the evaluation against human annotation shows that clear false positive calls are rare and additional counterparts lie in retained-introns, transcribed lncRNAs or repeat families. Further development of miRNAture suggests an inclusion of multiple rules to distinguish non-canonical miRNA families. This thesis describes multiple homology approaches to annotate the genomic information from a non-model chordate: the colonial tunicate Didemnum vexillum. Detected high levels of genetic variance and unexpected levels of DNA degradation were evidenced through a comprehensive analysis of genome-assembly methods and gene annotation. Despite those challenges, it was possible to find candidate homeobox and skeletogenesis- related genes. On its own, the ncRNA annotation included expected conserved families, and an extensive search of the Rhabdomyosarcoma 2-associated transcript (RMST) lncRNA family traced-back at the divergence of deuterostomes. In addition, a complete study of the annotation thresholds suggested variations to detect miRNAs, later implemented on the miRNAture tool. This chapter is a showcase of the usual workflow that should follow comprehensive sequencing, assembly and annotation project, in the light of the increasing research approaching DNA sequencing. In the last 10 years, the remarkable increment in tunicate sequencing projects boosted the access to an expanded miRNA annotation landscape. In this way, a comprehensive homology approach annotated the miRNA complement of 28 deuterostome genomes (including current 16 reported tunicates) using miRNAture. To get proper structural models as input, corrected miRBase structural alignments served as a scaffold for building correspondent CMs, based on a developed genetic algorithm. By this means, this automatic approach selected the set of sequences that composed the alignments, generating 2492 miRNA CMs. Despite the multiple sources and associated heterogeneity of the studied genomes, a clustering approach successfully gathered five groups of similar assemblies and highlighted low quality assemblies. The overall family and loci reduction on tunicates is notorious, showing on average 374 microRNA (miRNA) loci, in comparison to other clades: Cephalochordata (2119), Vertebrata (3638), Hemichordata (1092) and Echinodermata (2737). Detection of 533 miRNA families on the divergence of tunicates shows an expanded landscape regarding currently miRNA annotated families. Shared sets of ancestral, chordates, Olfactores, and specific clade-specific miRNAs were uncovered using a phyloge- netic conservation criteria. Compared to current annotations, the family repertories were expanded in all cases. Finally, relying on the adjacent elements from annotated miRNAs, this thesis proposes an additional syntenic support to cluster miRNA loci. In this way, the structural alignment of miR-1497, originally annotated in three model tunicates, was expanded with a clear syntenic support on tunicates

    Genome sequence of Ensifer arboris strain LMG 14919T: a microsymbiont of the legume Prosopis chilensis growing in Kosti, Sudan

    Get PDF
    Ensifer arboris LMG 14919T is an aerobic, motile, Gram-negative, non-spore-forming rod that can exist as a soil saprophyte or as a legume microsymbiont of several species of legume trees. LMG 14919T was isolated in 1987 from a nodule recovered from the roots of the tree Prosopis chilensis growing in Kosti, Sudan. LMG 14919T is highly effective at fixing nitrogen with P. chilensis (Chilean mesquite) and Acacia senegal (gum Arabic tree or gum acacia). LMG 14919T does not nodulate the tree Leucena leucocephala, nor the herbaceous species Macroptilium atropurpureum, Trifolium pratense, Medicago sativa, Lotus corniculatus and Galega orientalis. Here we describe the features of E. arboris LMG 14919T, together with genome sequence information and its annotation. The 6,850,303 bp high-quality-draft genome is arranged into 7 scaffolds of 12 contigs containing 6,461 protein-coding genes and 84 RNA-only encoding genes, and is one of 100 rhizobial genomes sequenced as part of the DOE Joint Genome Institute 2010 Genomic Encyclopedia for Bacteria and Archaea-Root Nodule Bacteria (GEBA-RNB) project

    Genome sequence of the Ornithopus/Lupinus-nodulating Bradyrhizobium sp. strain WSM471

    Get PDF
    Bradyrhizobium sp. strain WSM471 is an aerobic, motile, Gram-negative, non-spore-forming rod that was isolated from an effective nitrogen-(N-2) fixing root nodule formed on the annual legume Ornithopus pinnatus (Miller) Druce growing at Oyster Harbour, Albany district, Western Australia in 1982. This strain is in commercial production as an inoculant for Lupinus and Ornithopus. Here we describe the features of Bradyrhizobium sp. strain WSM471, together with genome sequence information and annotation. The 7,784,016 bp high-quality-draft genome is arranged in 1 scaffold of 2 contigs, contains 7,372 protein-coding genes and 58 RNA-only encoding genes, and is one of 20 rhizobial genomes sequenced as part of the DOE Joint Genome Institute 2010 Community Sequencing Program

    Phylogenetic signal and the utility of 12S and 16S mtDNA in frog phylogeny

    Get PDF
    Genes selected for a phylogenetic study need to contain conserved information that reflects the phylogenetic history at the specific taxonomic level of interest. Mitochondrial ribosomal genes have been used for a wide range of phylogenetic questions in general and in anuran systematics in particular. We checked the plausibility of phylogenetic reconstructions in anurans that were built from commonly used 12S and 16S rRNA gene sequences. For up to 27 species arranged in taxon sets of graded inclusiveness, we inferred phylogenetic hypotheses based on different apriori decisions, i.e. choice of alignment method and alignment parameters, including/excluding variable sites, choice of reconstruction algorithm and models of evolution. Alignment methods and parameters, as well as taxon sampling all had notable effects on the results leading to a large number of conflicting topologies. Very few nodes were supported in all of the analyses. Data sets in which fast evolving and ambiguously aligned sites had been excluded performed worse than the complete data sets. There was moderate support for the monophyly of the Discoglossidae, Pelobatoidea, Pelobatidae and Pipidae. The clade Neobatrachia was robustly supported and the intrageneric relationships within Bombina and Discoglossus were well resolved indicating the usefulness of the genes for relatively recent phylogenetic events. Although 12S and 16S rRNA genes seem to carry some phylogenetic signal of deep (Mesozoic) splitting events the signal was not strong enough to resolve consistently the inter-relationships of major clades within the Anura under varied methods and parameter settings

    Determining and comparing protein function in Bacterial genome sequences

    Get PDF

    Killing Two Birds with One Stone: The Concurrent Development of the Novel Alignment Free Tree Building Method, Scrawkov-Phy, and the Extensible Phyloinformatics Utility, EMU-Phy.

    Get PDF
    Many components of phylogenetic inference belong to the most computationally challenging and complex domain of problems. To further escalate the challenge, the genomics revolution has exponentially increased the amount of data available for analysis. This, combined with the foundational nature of phylogenetic analysis, has prompted the development of novel methods for managing and analyzing phylogenomic data, as well as improving or intelligently utilizing current ones. In this study, a novel alignment tree building algorithm using Quasi-Hidden Markov Models (QHMMs), Scrawkov-Phy, is introduced. Additionally, exploratory work in the design and implementation of an extensible phyloinformatics tool, EMU-Phy, is described. Lastly, features of the best-practice tools are inspected and provisionally incorporated into Scrawkov-Phy to evaluate the algorithm’s suitability for said features. This study shows that Scrawkov-Phy, as utilized through EMU-Phy, captures phylogenetic signal and reconstructs reasonable phylogenies without the need for multiple-sequence alignment or high-order statistical models. There are numerous additions to both Scrawkov-Phy and EMU-Phy which would improve their efficacy and the results of the provisional study shows that such additions are compatible

    Upcoming challenges for multiple sequence alignment methods in the high-throughput era

    Get PDF
    This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches
    corecore